Bayesian and frequentist approaches to sequential monitoring for futility in oncology basket trials: A comparison of Simon’s two-stage design and Bayesian predictive probability monitoring with information sharing across baskets

This article discusses and compares statistical designs of basket trial, from both frequentist and Bayesian perspectives. Baskets trials are used in oncology to study interventions that are developed to target a specific feature (often genetic alteration or immune phenotype) that is observed across multiple tissue types and/or tumor histologies. Patient heterogeneity has become pivotal to the development of non-cytotoxic treatment strategies. Treatment targets are often rare and exist among several histologies, making prospective clinical inquiry challenging for individual tumor types. More generally, basket trials are a type of master protocol often used for label expansion. Master protocol is used to refer to designs that accommodates multiple targets, multiple treatments, or both within one overarching protocol. For the purpose of making sequential decisions about treatment futility, Simon’s two-stage design is often embedded within master protocols. In basket trials, this frequentist design is often applied to independent evaluations of tumor histologies and/or indications. In the tumor agnostic setting, rarer indications may fail to reach the sample size needed for even the first evaluation for futility. With recent innovations in Bayesian methods, it is possible to evaluate for futility with smaller sample sizes, even for rarer indications. Novel Bayesian methodology for a sequential basket trial design based on predictive probability is introduced. The Bayesian predictive probability designs allow interim analyses with any desired frequency, including continual assessments after each patient observed. The sequential design is compared with and without Bayesian methods for sharing information among a collection of discrete, and potentially non-exchangeable tumor types. Bayesian designs are compared with Simon’s two-stage minimax design.

approaches [2]. In this paper we focus on an empirically Bayesian prior that is detailed in Hobbs and Landin, which identifies the maximum a posteriori (MAP) multi-source exchangeability model when evaluating all baskets simultaneously [3]. In the context of the presented simulations, B = 0 was selected to represent a Bayesian model with predictive probability monitoring for futility without information sharing. For MEMs where information sharing is implemented, the hyperparameter B = 0.1 is selected based on prior work that demonstrates a more conservative approach to information sharing that still induces improvements in trial operating characteristics [2,4,5]. In practice, higher hyperparameter values for B to allow more liberal information sharing or fully Bayesian priors of MEMs can be implemented and calibrated to achieve the desired type I error rates.
Posterior inference with MEMs is based upon the weighted average from all posterior model weights calculated from Bayesian model averaging. Similarly, the application of predictive probability monitoring at an interim stage of the trial as described in the Technical Appendix of the main manuscript can be applied to the posterior averaged across all MEMs. In our implementation the posterior weights for exchangeability of each configuration are estimated from the interim data, then 5000 possible future sets of results are simulated with the future posterior probability calculated with the same posterior weights for each MEM applied to calculate the PP. As in the Technical Appendix, if the PP is less than the selected threshold, φ, the basket would terminate for futility and cease enrollment, whereas other baskets can continue enrolling.
Greater details about the general structure of MEMs and the asymmetric implementation can be found in Section 2 and 3 for Gaussian outcomes of Kaizer, Koopemeiners, and Hobbs [1]. For binary outcomes additional details can be found in Kaizer, Hobbs, and Koopmeiners on the general structure in Sections 3.1 and 3.2 and prior specification considerations in Section 3.3 [2]. For more specific details on the symmetric context, Section 2.2 of the Hobbs and Landin paper introduces the general MEMs specification, Section 2.3 discusses how to estimate posterior probabilities and the effective sample size, and Section 2.4 presents the application of MEMs in the context of basket trials and the estimation of basketwise exchangeability [3].

Simon's Basket Trial Design with Information Sharing
Simon, et al., proposed a basket trial design for binary outcomes that could implement futility monitoring via posterior probabilities instead of the predictive probability while facilitating information sharing across baskets through a Bayesian hierarchical model [6]. One advantage of using posterior probabilities is that it can be more computationally efficient than calculating predictive probabilities. However, this comes at the expense of not accounting for the uncertainty of the future trial enrollment and what may be concluded if the trial reached full enrollment. We briefly describe the general approach of Simon below, but note that the have an interactive online Shiny app and have made R code available via GitHub.
Briefly, they propose a model with two hypotheses with corresponding priors specified: (1) the probability all baskets are homogeneous (i.e., exchangeable) and (2) the probability of activity in any given basket. In their original work, examples included a prior probability of homogeneous response of 0.5 and any basket having activity of 0.33. However, these priors can be set based on a given context. We detail our specific assumptions in the simulation design section below.

Bayesian Posterior Predictive Probability Calculations
This section demonstrates in detail Bayesian posterior predictive probability models used to formulate decision rules for monitoring futility during enrollment, and deciding whether sufficient "improvement" is evident given that the maximum sample size is achieved. This is presented in the absence of information sharing, where closed form solutions are available.
For basket trial with J baskets (j = 1, ..., J), let Y i,j = 1 if the ith participant in the jth basket experiences a success and Y i,j = 0 if the treatment fails to meet the efficacy outcome. π j represents the response rate for the jth basket. The posterior for π j with a Beta(α j , β j ) prior after 3 observing r j responses from n j patients follows the beta distribution where β() represent the beta function and 0 ≤ π j ≤ 1.
Let N max denote the planned maximum total number of patients that may enroll into each basket. Then N j denotes a random variable counting the total number of patients enrolled among N max in the jth basket. Let R j denote a random variable counting the number of responders among N max patients treated with the j th therapy. The predictive probability of observing s j successes (responses) among N max − n j future patients can be expressed as a product of gamma functions, Mathematically, the binary decision pertaining to whether sufficient improvement was evident is an evaluation of the posterior probability where π 0 represents the assumed response rate under the null. The posterior threshold θ ∈ (0, 1) controls the amount of "evidence" required to conclude success. Using properties of the beta distribution, this posterior probability follows as In the presence of partial enrollment, n j < N max , the predictive probability (PP) that the trial 4 ultimately demonstrates improvement for treatment in basket j follows as where I{} represents the indicator function. The decision to terminate enrollment in the j th basket after observing n j patients follows from evaluating the PP of eventual success λ(r j ) < φ, terminate enrollment to the j th basket for futility, λ(r j ) ≥ φ, continue enrolling patients to the j th basket, for a given threshold φ ∈ (0, 1).

Simulation Design
In our simulation studies we assume the null response rate is p 0 = 0.1 and the target response rate is p 1 = 0.3 for a basket trial enrolling across 10 baskets. Equal accrual is assumed across baskets throughout the duration of the simulated trials. A maximum sample size of 25 was selected to achieve 90% power and a 10% basket-wise type I error rate for Simon's minimax two-stage design [7] with a comparison to two Bayesian designs with predictive probability (PP) monitoring that are described below.
Simon's minimax two-stage design within each basket enrolls a maximum sample size of 25 with the only interim analysis after 16 participants. If 0 or 1 responses to treatment are observed at the interim analysis, the basket would terminate for futility. Otherwise, if the basket continues enrolling to a maximum of 25 participants, the decision to recommend further studies occurs if 5 or more responses are observed.
We consider three Bayesian designs in our simulation study: one without information sharing (i.e., B = 0), one with some information sharing using MEMs with B = 0.1, and an approach by Simon, et al., implementing information sharing through a hierarchical model [6]. Details on the priors and hyperparameter choices are given in the following paragraphs.
For the Bayesian designs without information sharing (i.e., B = 0) and with information sharing using MEMs (i.e., B = 0.1), a (0.5, 0.5) prior is placed on the response. The selection of B = 0.1 is based on prior studies where it has performed well with respect to the trade-off of sharing information while still offering improvements in power or reduced type I error rates in certain scenarios of interest, however we note more generally that hyperparameters such as values of B should be calibrated to each context in practice [5].
For Simon's design with posterior probability monitoring for futility, a prior on homogeneous response is set at 0.1 to reflect the hyperparameter assumed for MEMs. The prior for any basket showing efficacy is set at 0.33 based on their original paper.
Interim monitoring for futility is commenced after the 5th participant is observed and continues after each additional participant until the basket terminates for futility or the maximum sample size of 25 is enrolled. For designs with information sharing, terminated baskets are still considered in the evaluation of exchangeability even though they are no longer enrolling as they still contribute valuable information on treatment response. In all Bayesian designs the effect of PP thresholds is examined across a range of values from 0 to 0.5 by increments of 0.05. This reflects a range without stopping for futility (i.e., a fixed sample design with a PP threshold of 0) to thresholds with more aggressive termination for futility.

Posterior Probability Calibration
The posterior probability for efficacy is calibrated to achieve a 10% basket-wise type I error rate Calibration in all cases was based on simulating 1,000 fixed sample trials for the global null (all 10 baskets with a 10% response) and estimating the posterior probabilities for each basket in the simulated trial. Then, the threshold was identified so that a type I error rate of 10% for a given setting (basket-wise or family-wise control) was maintained. The calculated threshold was used for the mixed scenarios corresponding to their respective basket-wise or family-wise simulations.

Family-wise Type I Error Control Simulation Results
Simulation results where the Bayesian design posterior probability thresholds were calibrated to maintain a 10% family-wise type I error rate are presented. The Bayesian designs show a decrease in statistical power given the more stringent criteria of family-wise type I error rate control, but the trade-off is a basket-wise type I error rate well below 10% across all thresholds and a nearly   , calibrated to maintain the family-wise type I error without interim monitoring. Black coloring is for null baskets and gray coloring for alternative baskets. The dotted lines represent Simon's two-stage minimax design, dashed lines represent a design with interim monitoring after each participant based on Bayesian predictive probability futility monitoring, dashed-dotted lines represent Simon's (2016) design with information sharing with posterior probability futility monitoring, and the solid lines represent a Bayesian design that also facilitates information sharing across baskets based on exchangeability of the response rate with predictive probability futility monitoring. The rejection rate summarizes the proportion of baskets across the 1000 simulated trials where efficacy was concluded, the expected sample size presents the average number enrolled in a given null or alternative basket, and the stop rate describes the proportion of baskets that terminated early at any point for futility.  : Summary of scenario results for equally mixed scenario (5 null, 5 alternative baskets), calibrated to maintain the family-wise type I error without interim monitoring. Black coloring is for null baskets and gray coloring for alternative baskets. The dotted lines represent Simon's two-stage minimax design, dashed lines represent a design with interim monitoring after each participant based on Bayesian predictive probability futility monitoring, dashed-dotted lines represent Simon's (2016) design with information sharing with posterior probability futility monitoring, and the solid lines represent a Bayesian design that also facilitates information sharing across baskets based on exchangeability of the response rate with predictive probability futility monitoring. The rejection rate summarizes the proportion of baskets across the 1000 simulated trials where efficacy was concluded, the expected sample size presents the average number enrolled in a given null or alternative basket, and the stop rate describes the proportion of baskets that terminated early at any point for futility.